feat: GPU-accelerated SVD and quaternion conversion for ~4x speedup #72

Mozoloa · 2026-01-17T16:04:31Z

This PR adds optional GPU acceleration to the covariance matrix decomposition and rotation-to-quaternion conversion, providing significant performance improvements for CUDA users while maintaining backward compatibility.

Changes

`linalg.py`

Add use_gpu parameter to quaternions_from_rotation_matrices() (default: True)
Add pure PyTorch GPU implementation using Shepperd's method
Original scipy CPU implementation preserved as fallback
~300x faster for large batches (2M+ gaussians)

`gaussians.py`

Add use_gpu parameter to decompose_covariance_matrices() (default: True)
GPU path: SVD on GPU + vectorized reflection correction
CPU path: original float64 behavior preserved for maximum precision
Automatic fallback to CPU if GPU SVD fails

Performance

Tested on RTX 4090 with ~700k gaussians per frame:

Before: ~4.0s per frame (3s quaternion conversion on CPU)
After: ~1.0s per frame
4x overall speedup

The bottleneck was scipy.spatial.transform.Rotation.from_matrix() which requires CPU transfer and numpy conversion. The new GPU implementation stays entirely on device.

Backward Compatibility

Default behavior unchanged for CPU tensors
Set use_gpu=False to force original CPU behavior
API is fully backward compatible (new parameter has default value)

This PR adds optional GPU acceleration to the covariance matrix decomposition and rotation-to-quaternion conversion, providing significant performance improvements for CUDA users while maintaining backward compatibility. ## Changes ### `linalg.py` - Add `use_gpu` parameter to `quaternions_from_rotation_matrices()` (default: True) - Add pure PyTorch GPU implementation using Shepperd's method - Original scipy CPU implementation preserved as fallback - ~300x faster for large batches (2M+ gaussians) ### `gaussians.py` - Add `use_gpu` parameter to `decompose_covariance_matrices()` (default: True) - GPU path: SVD on GPU + vectorized reflection correction - CPU path: original float64 behavior preserved for maximum precision - Automatic fallback to CPU if GPU SVD fails ## Performance Tested on RTX 4090 with ~700k gaussians per frame: - Before: ~4.0s per frame (3s quaternion conversion on CPU) - After: ~1.0s per frame - **4x overall speedup** The bottleneck was `scipy.spatial.transform.Rotation.from_matrix()` which requires CPU transfer and numpy conversion. The new GPU implementation stays entirely on device. ## Backward Compatibility - Default behavior unchanged for CPU tensors - Set `use_gpu=False` to force original CPU behavior - API is fully backward compatible (new parameter has default value)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: GPU-accelerated SVD and quaternion conversion for ~4x speedup #72

feat: GPU-accelerated SVD and quaternion conversion for ~4x speedup #72

Uh oh!

Mozoloa commented Jan 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: GPU-accelerated SVD and quaternion conversion for ~4x speedup #72

Are you sure you want to change the base?

feat: GPU-accelerated SVD and quaternion conversion for ~4x speedup #72

Uh oh!

Conversation

Mozoloa commented Jan 17, 2026

Changes

linalg.py

gaussians.py

Performance

Backward Compatibility

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`linalg.py`

`gaussians.py`